Client Report - Can You Predict That?

Course DS 250

Author

[Carson Aller]

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
neighborhoods = "https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"
dwellings = "https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv"

# import your data here using pandas and the URL
neighborhoods = pd.read_csv(neighborhoods)
dwellings = pd.read_csv(dwellings)

Elevator pitch

This is a simple way of seeing the accuracy and legimilicy of your data extraction by using data classification building models.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

It is in simple terms with the numbers lower and some in the higher end.

Show the code
# Include and execute your code here

ggplot(dwellings, aes(y='before1980', x='livearea'))+ \
  geom_histogram() 
 

ggplot(dwellings, aes(y='before1980', x='sprice'))+ \
  geom_histogram() 

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

I was only able to get to at most 88% accuracy. I tried prices and stories but living area was the only one to get high enough to 90%.

Show the code
# Include and execute your code here

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

features = ['numbdrm', 'numbaths']

dwellings['before1980'] = (dwellings['livearea'] < 1980).astype(int)

X = dwellings[features]
y = dwellings['before1980']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
0.8874099934540693

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features. I chosed the number of bedrooms and baths in the house because it shows the features of the model from above. With their relationship together we can further back up the accuracy/make it better.

Show the code
# Include and execute your code here

ggplot(dwellings, aes(x='numbdrm', y='numbaths')) + \
  geom_boxplot ()

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

The first calculates the accuracy of the model which I used in task 2. The percision model shows how reliable the predictions are.

Show the code
from sklearn.metrics import accuracy_score, precision_score, recall_score


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)


print(accuracy)
print(precision)
print(recall)
0.8874099934540693
0.9098080462792533
0.9523809523809523

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here

Show the code
# Include and execute your code here